Add a 4-digit SWAR follow-up to loop_parse_if_eight_digits (clang)#382
Open
fcostaoliveira wants to merge 1 commit into
Open
Add a 4-digit SWAR follow-up to loop_parse_if_eight_digits (clang)#382fcostaoliveira wants to merge 1 commit into
fcostaoliveira wants to merge 1 commit into
Conversation
After the 8-digit SWAR block loop, consume a remaining 4-7 digit run in one read4_to_u32 + parse_four_digits_unrolled step instead of byte-by-byte (reusing the existing 4-digit helpers). The parsed result is identical; this is purely a faster way to consume the same digits. Gated to clang: on gcc the extra 4-digit check regresses inputs whose remainder is < 4 digits (e.g. the 17-digit fraction of uniform [0,1] -> -3% on 'random'), because the check becomes pure overhead there; clang does not show that. m8g.metal-24xl (Graviton4), -O3 -march=native, simple_fastfloat_benchmark, from_chars->double, clang 18, base vs patch back-to-back (2 samples): canada.txt +11.7%, mesh.txt +7.4%, random ~flat. No regression.
Contributor
Author
|
Exhaustive validation finished: all five |
lemire
approved these changes
Jun 1, 2026
Member
lemire
left a comment
There was a problem hiding this comment.
Let us merge once the tests are green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After the 8-digit SWAR block loop in
loop_parse_if_eight_digits, a remainingrun of 4–7 digits is currently finished one byte at a time by the caller. This adds
a single 4-digit SWAR step (reusing the existing
read4_to_u32/is_made_of_four_digits_fast/parse_four_digits_unrolledhelpers, already usedelsewhere in the file) to consume four of those digits at once. The parsed result is
identical — it is purely a faster way to consume the same digits.
Benchmark —
m8g.metal-24xl(Graviton4 / Neoverse V2),-O3 -march=native,simple_fastfloat_benchmark,from_chars→double, clang 18, base vs patchmeasured back-to-back (reproduced across 3 runs):
About the
#if defined(__clang__)gateI gated this to clang, and I want to be upfront about why rather than hide it. On
gcc the extra 4-digit check regresses inputs whose remainder is shorter than 4
digits (e.g. the 17-digit fraction of uniform
[0,1]: gcc spends the check but nevertakes it, ~−3% on
random), while clang shows no such regression and a large gain onthe mid-length fractions in
canada/mesh. Gating keeps the clang win with zeroregression on either compiler, but it does introduce a compiler-specific branch.
I'm happy to do whatever you prefer here: drop the gate (accepting the gcc
randomregression), keep it gated, rework it into a form that helps both, or drop the PR if
you'd rather not carry a compiler-specific path. Numbers and reasoning are all here so
you can decide.
Correctness
FASTFLOAT_TEST(14/14) + supplemental corpus + the randomizedrandom_string/short_random_stringtests pass under clang; an exhaustivefloat32sweep(
FASTFLOAT_EXHAUSTIVE) is running. Clean C++11/C++20 under-Werror -Wall -Wextra -Weffc++ -Wconversion -Wsign-conversion; clang-format clean.No multi-byte reads beyond the existing endian-safe
read4_to_u32, so big-endian isunaffected.